AITopics | evaluation task

Collaborating Authors

evaluation task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

525b8410cc8612283c9ecaf9a319f8ed-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 22:18:28 GMT

The gray agent consistently chooses the cyan object over the yellow object (a) . The same gray agent moves to the preferred cyan object (b).

artificial intelligence, machine learning, modeling & simulation, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Modeling & Simulation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Neural Program Search with Higher-Order Functions and Lambdas

Neural Information Processing SystemsFeb-16-2026, 05:44:19 GMT

Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.69)

Add feedback

Regression Planning Networks

Danfei Xu, Roberto Martín-Martín, De-An Huang, Yuke Zhu, Silvio Savarese, Li F. Fei-Fei

Neural Information Processing SystemsFeb-11-2026, 22:49:34 GMT

Performing real-world tasks such as cooking meals or assembling furniture requires an agent to determine long-term strategies. This is often formulated as aplanning problem.

artificial intelligence, machine learning, subgoal, (16 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.68)

Add feedback

What is the Best Sequence Length for BABYLM?

Salhan, Suchir, Martinez, Richard Diehl, Goriely, Zébulon, Buttery, Paula

arXiv.org Artificial IntelligenceOct-23-2025

Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.19493

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Learning-To-Measure: In-context Active Feature Acquisition

Kobayashi, Yuta, Jing, Zilin, Yao, Jiayu, Namkoong, Hongseok, Joshi, Shalmali

arXiv.org Artificial IntelligenceOct-15-2025

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness. Machine learning (ML) methods typically operate under the assumption that all input features are available at inference time. However, this assumption does not hold in scenarios where acquiring certain features involves significant costs or risks, such as medical diagnostics (Erion et al., 2022). For example, acquiring imaging data or invasive biopsies may incur substantial financial costs and pose potential risks to patient safety (Callender et al., 2021).

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2510.12624

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.88)

Add feedback

GmSLM : Generative Marmoset Spoken Language Modeling

Sternberg, Talia, London, Michael, Omer, David, Adi, Yossi

arXiv.org Artificial IntelligenceSep-16-2025

Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.09198

Country:

North America (0.46)
Asia (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

Mitigating Catastrophic Forgetting in Continual Learning through Model Growth

Süalp, Ege, Rezaei, Mina

arXiv.org Artificial IntelligenceSep-3-2025

Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks. This problem is particularly critical for large language models (LLMs) undergoing continual learning, as retaining performance across diverse domains is important for their general utility. In this paper, we explore model growth, a promising strategy that leverages smaller models to expedite and structure the training of larger ones for mitigating the catastrophic forgetting problem. Although growth-based pretraining, particularly via transformer stacking, has shown promise in accelerating convergence, its impact on forgetting remains under-explored. Therefore, we evaluate whether growth-based models can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias. Our findings show that both models -- one trained with growth (Stack LLM) and one without (LLM) -- exhibit improvements in domain knowledge. However, reasoning and reading comprehension degrade over time, indicating signs of catastrophic forgetting. Stack LLM consistently shows less degradation, especially in reading comprehension, suggesting enhanced retention capabilities. Interestingly, in bias evaluation, the baseline LLM becomes progressively more neutral with continued fine-tuning, while Stack LLM maintains a steady bias ratio around 60--61\%. These results indicate that growth-based pretraining may deliver modest improvements in resisting catastrophic forgetting, though trade-offs remain in handling social biases.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.01213

Country:

Europe (0.94)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Tune My Adam, Please!

Athanasiadis, Theodoros, Adriaensen, Steven, Müller, Samuel, Hutter, Frank

arXiv.org Artificial IntelligenceAug-29-2025

The Adam optimizer remains one of the most widely used optimizers in deep learning, and effectively tuning its hyperparameters is key to optimizing performance. However, tuning can be tedious and costly. Freeze-thaw Bayesian Optimization (BO) is a recent promising approach for low-budget hyperparameter tuning, but is limited by generic surrogates without prior knowledge of how hyperparameters affect learning. We propose Adam-PFN, a new surrogate model for Freeze-thaw BO of Adam's hyperparameters, pre-trained on learning curves from TaskSet, together with a new learning curve augmentation method, CDF-augment, which artificially increases the number of available training examples. Our approach improves both learning curve extrapolation and accelerates hyperparameter optimization on TaskSet evaluation tasks, with strong performance on out-of-distribution (OOD) tasks.

artificial intelligence, configuration, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2508.19733

Country: Europe > Germany > Baden-Württemberg (0.29)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Improved Generalized Planning with LLMs through Strategy Refinement and Reflection

Stein, Katharina, Hodel, Nils, Fišer, Daniel, Hoffmann, Jörg, Katz, Michael, Koller, Alexander

arXiv.org Artificial IntelligenceAug-20-2025

LLMs have recently been used to generate Python programs representing generalized plans in PDDL planning, i.e., plans that generalize across the tasks of a given PDDL domain. Previous work proposed a framework consisting of three steps: the LLM first generates a summary and then a strategy for the domain, both in natural language, and then implements that strategy as a Python program, that gets debugged on example planning tasks. In that work, only one strategy is generated and passed directly to the program generation. If the strategy is incorrect, its implementation will therefore result in an incorrect generalized plan. Here, we introduce an approach that generates the strategy in the form of pseudocode and enables automatic debugging of the pseudocode, hence allowing us to identify and fix errors prior to the generation of the generalized plan itself. Additionally, we extend the Python debugging phase with a reflection step prompting the LLM to pinpoint the reason for the observed plan failure. Finally, we take inspiration from LLM code generation to produce several program variants and pick the best one. Running experiments on 17 benchmark domains, we show that these extensions substantially improve (and never deteriorate) the quality of the generalized plans. In 12 of the domains, our best Python programs solve all tasks that can be generated with the respective instance generator.

generator, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.13876

Country: